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Abstract 



Analysis of networks and in particular discovering communities within networks has been 
a focus of recent work in several fields, with applications ranging from citation and friendship 
networks to food webs and gene regulatory networks. Most of the existing community detection 
methods focus on partitioning the entire network into communities, with the expectation of 
many ties within communities and few ties between. However, many networks contain nodes 
that do not fit in with any of the communities, and forcing every node into a community can 
distort results. Here we propose a new framework that focuses on community extraction instead 
^- ' of partition, extracting one community at a time. The main idea behind extraction is that the 

strength of a community should not depend on ties between members of other communities, 
but only on ties within that community and its ties to the outside world. We show that the 
new extraction criterion performs well on simulated and real networks, and establish asymptotic 
consistency of our method under the block model assumption. 



1 Introduction 



Understanding and modeling network structures has been a focus of attention in a number of diverse 
fields, including physics, biology, computer science, statistics, and social sciences. Applications of 
network analysis include friendship and social networks, marketing and recommender systems, the 
world wide web, disease models, and food webs, among others. A fundamental problem in the study 
of networks is community detection (see, for example, [19] for a comprehensive recent review). The 
extensive literature on the subject typically assumes that networks consist of communities, which are 
thought of as tightly-knit groups with many connections between the group members and relatively 
few connections between groups. We focus here on undirected networks N = (V,E), where V is 
the set of nodes and E is the set of edges, possibly weighted. The community detection problem is 
typically formulated as finding a partition V = UU- • -UVr- which gives "tight" communities in some 
suitable sense (several examples will be discussed below). The node sets V±,. . . ,Vk are typically 
taken to be disjoint, although there is some recent work on detecting overlapping communities 
[231 [U [26]. Whether communities are overlapping or not, every node is required to belong to at 
least one community. There are many examples of networks where such a requirement makes sense, 



for example, the college football games network [8j, and yet some commonly studied networks clearly 
do not fit this framework. For example, in the high school friendship network of [15] discussed later 
in the paper, there are people who belong to tight communities, but there are also people who do 
not have ties to any community at all. There is surprisingly little work allowing for such a network 
structure. 

In this paper, we propose broadening the framework of community detection to allow for a network 
to contain not only communities, but also "background" nodes that are not required to have tight 
connections to anything else in the network. We approach the problem via sequential community 
extraction rather than network partition: at each step, we extract the tightest (in a certain sense 
to be defined) community with the sparsest connections to the rest of the network. The process 
is repeated for the remaining nodes until no more meaningful communities can be extracted; the 
remainder of the network is then classified as background. Community extraction can also be used 
in conjunction with partition, for example, to identify the cores of communities found by a partition 
algorithm. The key characteristic of our approach is that we focus on edges within the candidate 
community to be extracted and edges connecting it to the rest of the network, and ignore edges 
within the rest of the network. The intuition here is that edges not related to nodes in a potential 
community should not influence our judgment on the community in question. A by-product of this 
process is a ranking of extracted communities, which standard partitioning methods do not provide. 
Extraction may also be more robust to changes in the network over time, since the definition of a 
community does not rely on links between unrelated nodes. 

To illustrate the motivation for our approach, consider the following toy example: out of n = 60 
nodes, 15 belong to a community where links between members form independently with probability 
0.5 each. The links from members to the other 45 nodes and links between the other 45 nodes all 
form independently with probability 0.1. Results of partition (using the modularity method of |20j) 
into two communities and community extraction by our method are shown in Figure [TJ Partitioning 
has to balance tightness of the two communities, and as a result includes a number of background 
nodes in the community. Extraction, on the other hand, separates the community out perfectly. 
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Figure 1: Toy example: shapes represent the truth and colors represent partition results using 
modularity (left) and extraction results using our method (right). 



Most popular community detection methods focus on maximizing links within communities while 
minimizing links between communities. This can be achieved either implicitly through an algorith- 
mic approach (see [17] for a review) or explicitly by optimizing a criterion that measures quality 
of a proposed partition over all possible partitions. The criteria proposed include ratio cuts [27] . 
normalized cuts [23] , spectral clustering [22] , and modularity |20t [T9] . All of these methods are de- 
signed to partition networks that consist of pure communities, with no background. Another class 
of methods assumes a parametric statistical model for generating the network and estimates the 
partitioning by maximizing the likelihood or by employing a Bayesian method. The models used for 
partitioning include the block model \25\ [3]. a mixture model [21] , univariate [13] and multivariate 
[12] latent variable models; for a comprehensive review of statistical models of networks, see [10] . 
While each statistical model has its advantages and often aids in interpretation of the data, it also 
imposes its own constraints and assumptions, in contrast to graph partitioning methods that do not 
typically depend on model assumptions (although in practice they may work well for some models 
and fail for others). Our method is also based on a model- free non-parametric criterion, and we 
will show that it is consistent under one of the commonly used statistical models, the stochastic 
block model. Finally, another line of work related to ours is the core-periphery partition methods 
[U [7] . These authors use a different criterion to separate a tight "core" from a sparse "periphery" , 
whereas our criterion is designed to extract the tightest community regardless of whether the rest 
of the network is sparse or contains other communities. 

In the remainder of the paper, we present the new methodology for community extraction, show 
that it is asymptotically consistent under the stochastic block model, and apply the method to a 
number of simulated and real networks, comparing extraction and partition results. The proofs of 
theorems are given in the Supplementary Material. 



2 The community extraction methodology 



To focus ideas, we start from discussing several related partitioning methods. An undirected 
network N = (V, E) with | V| = n nodes can be represented by an n x n adjacency matrix A = [Aij], 
where Aij > if there is an edge between nodes i and j and Aij = otherwise. If the network has 
weights associated with edges, the positive Ay's are the weights; if not, the positive Ay's are set 
to 1. Since we focus on undirected networks, A is symmetric. For simplicity, we focus on partition 
into two sets (Vi, V%), where V\ H Vi = and V = V\ U Vi. A partition is associated with a partition 
vector s, where Sj = 1 if node % belongs to Vi, and s% = —1 if node i belongs to V2. 

A naive way to partition a network is to minimize the total weight R of edges connecting V\ and 
Vi over all possible partitions (the min-cut method). The total weight, or the cut, is given by 

R= Yl A v- C 1 ) 

However, minimizing R yields a trivial solution of V\ = V . The ratio cut approach [27] avoids the 
trivial solution by minimizing i?/(|Vi| • | V2 1) » where |Vi| and IV2I are the sizes of the two groups. 
Efficient spectral algorithms for computing the ratio cut are available [Tl]. Another approach is 
minimizing the normalized cut [23], initially proposed for image segmentation. The normalized cut 
is defined as RjD\ + R/D2, where Dk = YlieV k jev^ij ^ or ^ = 1,2 is the total number of edges 



involving nodes in V&. Dividing by D^ encourages balanced group sizes and avoids trivial solutions. 
The normalized cut criterion can be approximated by a generalized eigenvalue problem and thus 
solved efficiently 

In the context of community detection in networks, perhaps the most popular criterion for parti- 
tioning is modularity |20j. The intuition behind the modularity criterion is to compare the observed 
number of edges within groups to the expected number under the configuration model of |6j . Under 
this model, an edge between nodes % and j is created independently of other edges with probability 
Pij = kikj/2m, where where fcj = Yjj-^-ij * s the degree of node i and 2m = Yli^i is twice the 
number of edges in the network. The modularity criterion is then defined as 

Q^E^i-^]^;- ( 2 ) 

where the multiplier l/4m is a matter of convention. Like the normalized cut, the modularity 
solution can be approximated using the eigen-decomposition of the corresponding modularity ma- 
trix Aij — P^ [T5], and thus computed efficiently. In fact, in practice the normalized cut and the 
modularity solutions tend to be very similar, even though they are motivated by quite different 
considerations. 

For simplicity, we have given all the criteria above for partitioning into two communities; the 
normalized cut and modularity can easily be adjusted to partition into K communities. Typically, 
however, K is unknown and estimating K is an open problem. Another option, more common in 
practice, is to proceed sequentially: split the network into two groups, then split each group further, 
and so on. This greedy method may miss the optimal partition. It also requires a stopping criterion; 
for modularity, a natural rule is to stop when a proposed split decreases the overall modularity [IS] . 

Regardless of the criterion used, partitioning methods are not designed to deal with the situation 
when background nodes are present, without tight links to any part of the network. Typically, 
such nodes will be split and grouped together with tighter communities present in the network, 
rather than separated out in a class of their own. This is mainly because partitioning methods are 
symmetric - the sets V\ and V2 can be interchanged in any of the above criteria. However, if the 
goal is to separate a tight community from a sparse background, the roles of V\ and V2 cannot be 
the same - in fact, they have to be the opposite. This observation lies at the core of our proposed 
methodology. 



2.1 The community extraction criterion 

The criterion we propose extracts one community at a time by looking for a set with a large number 
of links within itself and a small number of links to the rest of the network. The links within the 
complement of this set do not matter, and thus the remainder can contain background nodes and/or 
other communities. To emphasize the lack of symmetry in the criterion, we denote the community 
to be extracted by S and its complement by S c (rather than V\ and V2). Then we maximize the 
following extraction criterion over all possible S: 

= qisi_B{sl 

1 ' \S\ 2 \S\\S C \ ' { ' 



where 

o(s) = J2 a *j > B ( s ) = E ^ • 

i,jeS i£S,jeS c 

The term 0(S) is twice the number of the edges within S, and B(S) counts connections between 
S and the rest of the network. Each term is normalized by the total number of possible edges in 
each case, which gives these quantities a natural interpretation as probability estimates under the 
stochastic block model, discussed further below. Note that we ignore the issue of self-loops and 
normalize the first term by \S\ 2 rather than (SKIS*! — 1); in practice this makes little difference. 

One drawback of criterion ([3]) is that, like the original graph cut, it does not explicitly guard against 
splitting off small communities. The trivial solution does not maximize W, but nonetheless, in a 
large sparse network a very small community can give a high value of W, since the second term 
will be made negligible by the large \S C \ in the denominator. To avoid this situation, we can use 
an adjustment in the spirit of the ratio cut, and maximize the following criterion instead: 



W a (S) = \S\\S C \ 



O(S) B(S) 



\S\ 2 \S\\S C 



(4) 



Since | -S 1 1 1 J S' C | is maximized at \S\ = n/2, this factor penalizes very small and very large communities 
and produces more balanced solutions. Empirically, we found that the adjustment makes a differ- 
ence in sparse networks, but plays no role in dense networks. Later we show that asymptotically 
both criteria are consistent. 

Our community extraction procedure consists of sequentially applying criterion (0]): we extract a 
community and apply the extraction again to its complement. If there is prior information on the 
correct or desired number of communities to be extracted, we stop after this number has been 
obtained and declare the rest to be background. In the absence of such information, we continue 
until we cannot find a community bigger than a certain preset size (for the examples in this paper, 
5 nodes), and classify the rest as background. This procedure is greedy, but it differs from the 
usual sequential graph partitioning in that we never split a community further once it has been 
extracted, and thus we are not in danger of producing a solution consisting of a large number of 
small communities. 



2.2 Maximizing the extraction criterion 

Finding the exact global maximum of the extraction criterion is NP-hard. Here we use a local 
optimization technique based on label switching known as tabu search [21 E]. The key idea of tabu 
search is that once a node label has been switched, it cannot be switched again for the next T 
iterations (the node has "tabu" status). This guards against being trapped in a local maximum. 
The algorithm starts from an initial value and examines all current non-tabu nodes in order. If the 
current value of the global maximum can be improved, the node label is switched, its status changed 
to tabu, and the algorithm returns to node 1. If no node can be switched to improve the global 
maximum, the node that gives the largest increase in the current criterion value is switched, and if 
no increase is possible, the node that gives the smallest decrease is switched. The algorithm is run 
for a prescribed number of iterations, and the best solution seen in the course of these iterations is 



taken to be the final solution (not necessarily the one from the last iteration). Note that the value 
of the criterion ([3D can be updated efficiently in 0(n) operations for a single label switch. Finally, 
since the algorithm depends on the order of nodes as well as on the initial value, we run it for a 
number of random starting values and random orders of nodes. 



3 Asymptotic Consistency 



Our algorithm does not explicitly rely on a stochastic model for the network, but if we assume the 
network has been generated by a block model [14|, I25| , we can establish asymptotic consistency of 
the extraction criterion using the recent results of [3]. The general block model assumes that nodes 
belong to one of K blocks, and the network is generated as follows. First, each node is assigned to 
a block independently of other nodes, with P{o L = k) = ir^, 1 < k < K, X)fc=i n^ = 1, where c = 
(ci, . . . , c n ) is the n x 1 vector of labels representing node assignments to blocks. Then, conditional 
on c, edges are generated independently with probabilities P[Aij = 1|q = a,Cj = b] = p a b- The 
vector of probabilities 7v = {iri, . . . , ttk} and the KxK symmetric matrix P = \p a b]i<a,b<K together 
specify a block model. We can stipulate the presence of background by requiring, for instance, that 
PaK < Pbb for all a = 1, . . . , K, all b = 1, . . . , K — 1. One could further assume that p a x = p for all 
a = 1, . . . , K, but this assumption is not necessary for the theory to be developed. 

We consider asymptotic consistency of label assignments by the extraction method as the number 
of nodes n — > oo. If P does not change with n, the network will become very dense as n grows, so 
we allow P n to depend on n. A natural parametrization is P n = p n P, where p n = P[Aij = 1] — > 
is the probability of an edge between arbitrary nodes i and j. The expected node degree A n = np n 
becomes the natural parameter to control asn-^co. 

Bickel and Chen [3] developed a general framework for checking whether a community-finding 
criterion can correctly recover the true node labels asn^ oo, under the block model assumption. 
The details of their conditions for asymptotic consistency are given in the Supplementary Materials. 
Briefly, the main condition is that the proposed criterion is maximized by the true label assignment 
when all the sample quantities in the criterion are replaced by their population equivalents, which 
can be viewed as a special case of the general theory of minimum contrast estimation [1]. 

Since we perform community extraction sequentially, we focus on checking consistency for the case 
K = 2 (one extracted community plus the rest of the network). The matrix P is 2 x 2 with three 
unique parameters pu, P22, P12, arid the vector of class probabilities {tt, 1 — ir} is determined by 
the single parameter tt. Theorem [T] gives consistency of criterion ([3]). 

Theorem 1. Suppose j-f^ — > 00, then for any < tt < 1, if pn > P12, Pn > P22 and pn + P22 > 
2p\2, the maximizer c^ n ' of criterion ([3]) satisfies 

P[c^ = c] -t 1 as n ->■ 00, 

where c are the true labels. 

Note that the simplest case of one community with background nodes connecting at random to all 
nodes in the network (pi2 = P22 = p) is covered by the theorem as long as p\\ > p. 
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The adjusted criterion (j3|) differs from the original criterion ([3]) by a factor of | J S'[ )<S' C [, but it turns 
out that this factor has no effect in the limit, and asymptotic consistency holds for the adjusted 
criterion as well. 

Theorem 2. (Adjusted criterion) Suppose j— f^ — > oo, for any < tt < 1, if pn > p\2, Pu > P22 
and pn + P22 > 2pi2, the maximizer c^ n > of criterion (j3|) satisfies 

P[& n > = c] — Y 1 as n — >■ 00. 
Proofs of both theorems are given in the Supplementary Materials. 

4 Numerical evaluation 



In this section, we compare performance of our original extraction criterion, adjusted extraction 
criterion and modularity for three different simulated scenarios. We compare the methods using the 
positive predictive value (PPV) and the negative predictive value (NPV), defined as follows. Let S 
be the extracted community, and let Cs be the true community that matches 5 best, determined 
by majority vote. Then we define 

PP V J C '' nS l 



NPV=1 



\S\ 

\c s ns c \ 
\s c \ 



The PPV is a measure of purity of the extracted community, and the NPV is a measure of com- 
pleteness. 

First we consider the case of two communities with no background, to check that our method 
works in this standard situation. Specifically, we generate networks with 1000 nodes from a block 
model with K = 2, pn = 0.5, P22 = 0.4, and P12 = 0.05, with fixed first community size n\. The 
number of replications in this simulation and all following is fixed at 50. Table Q] shows the means 
and standard deviations of PPV and NPV over 50 replications for n\ = 100; for more balanced 
community sizes (ni = 200 and larger) all methods find the ideal partition, and these results are 
not shown. 

Table 1: Results for two communities with no background: mean(SD) of positive and negative 
predictive values over 50 replications. 

Modularity Original Adjusted 



PPV 
NPV 



1(0) 1(0) 1(0) 

0.84(0.03) 1.00(0.00) 0.71(0.06) 



All methods do perfectly in terms of PPV, meaning all nodes in the extracted community belong to 
the same true class. The original criterion gives the best performance on NPV (meaning no nodes 
from the true community were "lost" in extraction), since this network is dense, and the problem 
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with splitting off small clusters does not occur. The adjusted criterion performs worse (although for 
larger n\ all methods perform the same); however, the opposite phenomenon occurs for relatively 
sparse networks, which is illustrated in the next example. 

In the second simulation, we consider sparse networks with one community and background, gen- 
erated from the block model. Again we fix the number of nodes at 1000, and let p\2 = P22 = 0.05. 
We consider three community sizes (n\ = 100, 200, 300), and three values of p\\ = 0.1, 0.15, and 
0.2. In the third simulation, we consider a more complicated situation with two communities with 
similar densities and a sparse background. The network size is fixed at 1000 and the probability 
that a background node has a link to any other node is 0.05. We consider two different sizes for 
both communities (100 and 300), and three values of p\\ = 0.05x, P22 = 0.04x, for x = 2, 3, 4. 
Here we compare the results of extracting one community with partitioning into two communities 
using modularity. 

The results for the second and third simulations are presented in Figures [2] and [3j respectively, 
with boxplots of the 50 replication values of NPV and PPV shown side by side for all methods. 
For the case of weakly connected communities (pu = 0.1), the original criterion tends to favor 
small clusters, which is evident from the NPV values. On the other hand, the adjusted criterion 
performs best overall - even though the NPVs of the adjusted criterion are slightly lower than those 
of modularity, the PPVs are significantly higher, particularly for weak signals (small ri\ and low 
P\\). This is because the background nodes are roughly equally split between the two parts found 
by modularity. 



pi 1=0.1 



p1 1=0.15 



p1 1=0.2 



Figure 2: Boxplots of PPV and NPV for one community with background. M: Modularity; O: 
Original extraction criterion; A: Adjusted extraction criterion 
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Figure 3: Boxplots of PPVs and NPVs for two communities with background. M: Modularity; O: 
Original extraction criterion; A: Adjusted extraction criterion 



5 Examples 



In this section, we apply our extraction method (using the adjusted criterion) to several real-world 
networks and compare results with partition as computed by modularity. We perform extraction 
sequentially, and each time use ten different starting values for the tabu search. We stop the 
extracting procedure when the proposed community has fewer than five nodes. 



5.1 The karate club network 



Our first example is a well-known friendship network representing friendships between 34 members 
of a karate club [28J. This club had subsequently split into two parts following a disagreement 
between an instructor (node 0) and an administrator (node 33), and these two groups are used as 
the "ground truth" in benchmark studies of community detection algorithms. Modularity partitions 
this network into exactly the true factions [19] . The extraction approach can be used to supplement 
this division with more information by identifying the "cores" of each faction. Extraction found 
three groups - the cores of two factions, which contain the instructor and the administrator (shown 
in green and red in Figure HJ), and a small tight community within the instructor's faction (shown 



in yellow in Figure H|) . Note that there are no extracted communities that mix the members of the 
two factions. 

(a) Partition (b) Extraction 
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Figure 4: Results for the karate club network 



5.2 The political books network 

The nodes in this network |18j are 105 recent political books with links representing pairs of books 
reported by Amazon as "frequently bought together". Following |18j . we show the modularity solu- 
tion with node colors representing the components of the leading principal vector of the modularity 
matrix in Figure 02a) . These values result from relaxing the labels from ±1 to real- valued, and 
the modularity partition is computed from the signs of these values (represented by node shapes). 
The node colors can be interpreted to represent the book's position on the political spectrum, with 
blue being the most liberal and red the most conservative |18| . Figure [5] shows that in addition to 
a few clear "red" and "blue" books, many nodes are in fact "purple", and may not clearly belong 
to either the left or the right. From the colors alone (i.e., component magnitudes), it is not clear 
how to separate out the "blue" and the "red" from the more centrist "purple" , whereas community 
extraction can do this easily. Figure \Mb) shows the first two extracted communities which clearly 
correspond to the cores of the left and the right. Further communities can be extracted, which we 
do not discuss here for lack of space. 



5.3 The school friendship network 

This dataset is a school friendship network compiled from the National Longitudinal Study of 
Adolescent Health (see [15] for more information). The survey asked students in grade 7 through 
12 from 127 schools to name their close friends and answer a few questions to measure the strength 
of their friendships. Based on the answers, researchers constructed networks for each school with 
a weight 1-6 on each (directed) edge representing the strength of the friendship. Here we analyze 
the friendship network of school 1 from this dataset, converting the data to an undirected network 
by averaging the weights on the two edges connecting each pair of nodes. The resulting network 
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(a) Partition (b) Extraction 





Figure 5: Results for the political books network. Node colors in (a) represent the components of 
the principal eigenvector of the modularity matrix, and node shapes represent the partition. 



with 71 nodes is shown in Figured] (a), with colors representing grades. We show the results of 
extraction with six groups (to match the number of grades) in Figure Etc), and with seven groups, 
the number suggested by our stopping criterion, in Figure [f^d). The seventh group picks up the 
grade shown in orange, which has only four nodes. 

For comparison, we partitioned the network into six communities by using the sequential procedure 
suggested in |19j . which partitions all communities from the previous step into two, and then the 
partition yielding the largest modularity is chosen to perform the next split. The modularity results 
shown in Figure E^b) are noticeably different from the grades themselves and from the extracted 
communities: a large part of the yellow grade is merged with green, and the smallest orange grade is 
split into three different groups. This is partly a result of the greedy sequential splitting procedure, 
but also of the fact that modularity is forced to assign every single node to a community, with 
the extreme example being the two nodes that have no links at all. The ability of our method to 
extract communities rather than partition into communities not only allows us to handle these and 
other weakly connected nodes correctly, but also appears to lead to more meaningful groups in this 
example. 



6 Summary and Discussion 



We have proposed a new framework for analysis of social networks, which extracts tight communities 
out of the network and allows for background nodes. In the examples we considered, it offers 
an additional insight into the network structure, and can be used as either an alternative or a 
complement to network partitioning. While we have obtained good results with the tabu search, it 
may be beneficial to formulate the extraction criterion as an eigenvalue problem, which is work in 
progress. More work is also needed on the stopping criterion and determining the correct number 
of communities; this question, however, is common to all community detection methods. Finally, 
assessing the quality of a proposed extraction/partition is an open problem; one solution based 
on robustness to permutations was proposed in [16], but assessing statistical significance under 
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(b) Partition with 6 groups (c) Extracting 6 groups (d) Extracting 7 groups 



o 






• • oc4« 



.'•• 



oX°oo Q * 



V 



o r 



•• 



op "oAo 



£4 .'•• 



A? *i.o>."« 




Figure 6: Results for the school friendship network. 



appropriate null models, and formulating such null models, is also of interest. 



Appendix: Proofs of Theorems 



Here we first state a simpler version of the main theorem of Bickel and Chen [3] sufficient for our 
purposes, and then apply it to prove Theorems 1 and 2. The theorem holds for a general K but to 
simplify notation we only state it for K = 2. We start from introducing additional notation. Let 
O be a 2 x 2 matrix defined by 

O kl (s, A) = ^2 T ( s i = fc ' s j = • 

l<i,j<n 

Evidently, O^k is twice the number of edges among nodes in the fc-th community and O^i is the 
number of edges between the /c-th and l-lh. communities. Let R be the confusion matrix, 



1 " 
R a b(s, c) = -y~] I(si = a,a = b) 



where s is a proposed label assignment and c is the vector of true labels. Finally, let f(c) = R 1 
and f(s) = Rl be the proportion of nodes in each block for the assignments c and s, respectively. 

Letting /j, n = n 2 p n , we can write the extraction criterion [3] (up to a multiplicative factor) in the 
form 



Further, it is easy to verify that 

E{n- 2 0{s,A)\c) 



/'/. 



/'/. 



R(s,c)PR T (s,c). 
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Thus the population version of Q is F(RPR , Rl). Then a natural necessary condition for asymp- 
totic consistency of the criterion is that its population version is maximized by the correct diagonal 
confusion matrix, which gives us condition (CI) in the following Theorem by Bickel and Chen: 

Theorem Al. Suppose F, P and it satisfy the following conditions: 

(CI) F(RPR T ,R1) is uniquely maximized over & = {R : R > 0,R T 1 = (jr, 1 - 7r)'} by R = 

D(tt) = diag(ir, 1 — n), for all (vr, P) in an open set 6. 

(C2) P has no identical columns. 

(C3) (a) F is Lipschitz in its arguments; (b) Let W = D(tt)PD(tt). The directional derivatives 
^p-(Mo + e{M\ — Mo), to + e(t\ — to))|e=o+ o- r ^ continuous in (Mi,ti) for all (Mo, to) in 
a neighborhood of (W, C(tt)), where C(vr) = (vr, 1 - ir) T ; (c) Let G(R, P) = F(RPR T , Rl). 
Then on £?, 9G((i-e)D(n)+eR,P) ^ =o+ < _ c <Q for M ( 7F) p) G @ 

If & n > is the maximizer ofQ(s,A) and j-^ — > oo ; then, for all (tt,P) £ 0, 

Pfc (n) =£ c) 
limsup , < -sq(tt,P) < 0. 

Proof of Theorem 1. Condition (C2) holds trivially and it is straightforward to check condition 
(C3), so we only check the essential condition (CI). We have W(S) = p n F(0/fi n ,f(s)), where 
F(M,t) = Mn/tf — Mi 2 /(t\t 2 ). Thus, the population version of the criterion can be written as 

Q(R, P) =- ■ -2 (rnPn + 2r n r 12 p 12 + rj 2 p 22 ) 

(m + n 2 y 

(rnr 2 ipn + r u r 22 p 12 + r 12 r 21 p 12 + r 12 r 22 p 22 ) • 



(ni + n 2 ){r 21 +r 22 ) 



We need to maximize this function over R under the constraint R 1 = (-zr, 1 — vr) . Taking the 
transformation t\ = r\\/(r\\ + r\ 2 ), t 2 = r 22 j(r 2 \ + r 22 ), we obtain 



/ =P22 - P12 + (pn - 2pi2 + P22) 



*i(ii + i 2 - 1) - 2^1 +*2) 



+ -^(PU -P22)(h +t 2 ). 



It is easy to verify that the function g(ti,t 2 ) = ti(ti + t 2 — 1) — {t\ + t 2 )/2 has two maximizers, 
ti = l,t 2 = 1 and t\ = 0, t 2 = 0. Thus under the condition pu — 2p\ 2 + p 22 > 0,pn > P22, the 
unique maximizer of / is £1 = 1, t 2 = 1, or equivalently, R = diag(7r, 1 — ir). □ 

Proof of Theorem 2. Again, we only verify (CI), since (C2) and (C3) are straightforward. For 
the adjusted criterion, we need to maximize 



/ =(rn +r l2 )(r 21 +r 22 ) 
1 



-( r iiPn + 2r 1± r 12 pi 2 + r\ 2 p 22 ) 



(ru +ri 2 )(r 21 + r 22 



(m + n 2 ) 2 

r(riir 2 ipn + r u r 22 p 12 + r 12 r 2i p 12 + r 12 r 22 p 22 ) 
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under the constraint R 1 = (tt,1 — it) t . Applying the same transformation t\ = ru/{r\i + ri 2 ), 
h = r 22 /(r 2 \ + r 22 ), we obtain 



, (ti -7r)(i 2 — (1 — 7r)) f 

/ = — u +t _ ^2 — \ P 22 ~ P 12 + yP n ~ 2 p 12 + p 22 > 

+ 2^11 -P22)(h +t 2 ) 



tl(tl+t 2 -l)--(tl+t 2 ) 



where (£1,^2) & [0,71"] x [0, 1 — 7r] U [it, 1] x [1 — tt, 1]. The only interior point t* which potentially 
satisfies V/(t*) = is 

,* = P22 ~ Pl2 .* = Pll -Pl2 

1 Pn + P22 - 2pi2 2 P11 + P22 - 2pi2 ' 

However, since t\ + t| = 1, the only intersection with the feasible region is at t\ = n, t\ = 1 — n, and 
thus / can only be maximized on the boundary of the feasible region. Since all functions involved 
are monotone and convex, it is easy to check the boundary values; comparing all possible solutions 
shows that the unique maximizer of / is t\ = 1, t 2 = 1, or equivalently, m = TTj r 22 = 1 — vr. □ 
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